2021-03-11
| “Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.” |
Data visualization is all about communication.
Just like in graphics design, less is more. To get a good graphics remove all excess ink.
Resist the temptation of showing every bit of data. If necessary, put it in the supplementary materials.
p <- mtcars %>% group_by(cyl) %>%
summarise(mean_mpg=mean(mpg)) %>%
mutate(cyl=factor(cyl)) %>%
ggplot(aes(x=cyl, y=mean_mpg, fill=cyl))
p + geom_bar(stat="identity", mapping=aes(fill=cyl)) +
theme(axis.line=element_line(size=1, arrow=arrow(length=unit(0.1, "inches"))))
“Clutter and confusion are failures of design, not attributes of information.” (Tufte)
boxplot(hwy ~ class, data=mpg)
mpg %>% ggplot(aes(x=class, y=hwy)) + geom_boxplot()
mpg %>% ggplot(aes(x=class, y=hwy)) + geom_boxplot() +
geom_dotplot(binaxis="y", stackdir="center", fill="grey", dotsize=.3)
mpg %>% ggplot(aes(x=class, y=hwy)) + geom_boxplot() +
geom_dotplot(binaxis="y", stackdir="center", fill="grey", dotsize=.3)
p <- list()
p$p1 <- ggplot(mtcars, aes(x=disp, y=hp, color=factor(cyl))) + geom_point()
p$p2 <- ggplot(mtcars, aes(x=disp, y=hp, color=factor(cyl))) + geom_point() +
theme_par()
p$p3 <- ggplot(mtcars, aes(x=disp, y=hp, color=factor(cyl))) + geom_point() +
theme_cowplot()
p$p4 <- ggplot(mtcars, aes(x=disp, y=hp, color=factor(cyl))) + geom_point() +
theme_tufte()
p <- map(p, ~ . + theme(plot.margin=margin(20, 0, 0, 0)))
plot_grid(plotlist=p, labels=c("Default", "Par", "Cowplot", "Tufte"))
“Above all else show the data.” (Tufte)
(demo)
Editorial. “Kick the bar chart habit.” Nature Methods 11 (2014): 113.
It is not important which system you use. It is important that you first come up with the idea how you want the data to be plotted, and that you can plot it – with whatever means you can. (Where should you look for a lost watch?)
par() systemdata(mpg) ggplot(mpg, aes(x=hwy, y=cty)) + geom_point()
We will now use world inequality data to create a bar plot.
First, we prepare the data using tidyverse.
wid <- read_excel("../Datasets/WIID_19Dec2018.xlsx")
wid <- wid %>% drop_na(gini_reported, q1:q5, d1:d10)
wid2015 <- wid %>% filter(year==2015 &
region_un == "Europe" &
population > 5e6)
wid2015sel <- wid2015 %>%
filter(quality=="High") %>%
filter(!duplicated(country)) %>%
select(country, gini_reported, q1:q5, d1:d10)
## we mess the quantiles on purpose
data <- wid2015sel %>%
gather(q1:q5, key="quantile", value="proportion") %>%
mutate(quantile=factor(quantile, levels=paste0("q", c(2, 1, 5, 4, 3))))
Now we pass the data to ggplot.
p <- data %>% ggplot(aes(country, proportion, fill=quantile)) + geom_bar(stat="identity") + coord_flip() p
coord_flip() so the bar plot is horizontalgeom_bar() uses the fill estheticsdata <- data %>% mutate(quantile=factor(quantile, levels=paste0("q", 5:1)))
p <- data %>%
ggplot(aes(country, proportion, fill=quantile)) +
geom_bar(stat="identity") + coord_flip()
p
data <- wid2015sel %>%
mutate(country=reorder(country, desc(gini_reported))) %>%
gather(q1:q5, key="quantile", value="proportion") %>%
mutate(quantile=factor(quantile, levels=paste0("q", 5:1)))
p <- data %>%
ggplot(aes(country, proportion, fill=quantile)) +
geom_bar(stat="identity") + coord_flip()
p
p + theme_tufte() + scale_fill_brewer(palette="Blues") +
ylab("Proportion of wealth") + xlab("Country") +
guides(fill=guide_legend(reverse=TRUE))
There are many ways to represent colors. In R, we most frequently use the RGB scheme in which each color is composed of three values for each of the three colors: red, green and blue.
One way is to choose values between 0 and 1; another, between 0 and 255. The latter can be represented using hexadecimal notation, in which the value goes from 0 to FF (15 * 16 + 15 = 255). This is a very common notation, used also in HTML:
"#FF0000" or c(255, 0, 0): red channel to the max, blue and green to the minimum. The result is color red."#00FF00": bright green"#000000": black"#FFFFFF": whiteTo get the color from numbers in 0…1 range:
rgb(0.5, 0.7, 0) # returns “#80B300”
To get the color from numbers in 0…255 range:
rgb(255, 128, 0, maxColorValue=255)
Useful way to handle large numbers of data points. #FF000000: fully transparent; #FF0000FF: fully opaque.
x <- rnorm(10000)
y <- x + rnorm(10000)
p1 <- ggplot(NULL, aes(x=x, y=y)) + geom_point() +
theme_tufte() + theme(plot.margin=unit(c(2,1,1,1), "cm"))
p2 <- ggplot(NULL, aes(x=x, y=y)) + geom_point(color="#6666661F") +
theme_tufte() + theme(plot.margin=unit(c(2,1,1,1),"cm"))
plot_grid(p1, p2, labels=c("Black", "#6666661F"))
Useful way to handle large numbers of data points. #FF000000: fully transparent; #FF0000FF: fully opaque.
There are several other representations of color space, and they do not give exactly the same results. Two common representations are HSV and HSL: Hue, Saturation and Value, and Hue, Saturation and Luminosity.
There are many packages to help you manipulate the colors using hsl and hsv. For example, my package plotwidgets allows you to change it using the HSL model.
library(plotwidgets)
## Now loop over hues
pal <- plotPals("zeileis")
v <- c(10, 9, 19, 9, 15, 5)
a2xy <- function(a, r=1, full=FALSE) {
t <- pi/2 - 2 * pi * a / 360
list( x=r * cos(t), y=r * sin(t) )
}
plot.new()
par(usr=c(-1,1,-1,1))
hues <- seq(0, 360, by=30)
pos <- a2xy(hues, r=0.75)
for(i in 1:length(hues)) {
cols <- modhueCol(pal, by=hues[i])
wgPlanets(x=pos$x[i], y=pos$y[i], w=0.5, h=0.5, v=v, col=cols)
}
pos <- a2xy(hues[-1], r=0.4)
text(pos$x, pos$y, hues[-1])
There are many packages to help you manipulate the colors using hsl and hsv. For example, my package plotwidgets allows you to change it using the HSL model.
It is not easy to get a nice combination of colors (see default plot in ggplot2 to see how not to do it).
There are numerous palettes in numerous packages. One of the most popular is RColorBrewer. You can use it with both base R and ggplot2.
library(RColorBrewer) par(mar=c(0,4,0,0)) display.brewer.all()
par(mar=c(0,4,0,0)) display.brewer.all(colorblindFriendly=T)
data("iris")
The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. Fisher 1936
pal <- brewer.pal(3, "Dark2")
iris$Species <- factor(iris$Species)
cols <- pal[ iris$Species ]
plot(iris$Sepal.Length, iris$Sepal.Width, col=cols, pch=19,
xlab="Sepal length", ylab="Sepal width", bty="n", cex=1.5)
legend("topright", levels(iris$Species), col=pal, pch=19, bty="n")
You can easily use ggplot with RColorBrewer palettes:
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=4) +
scale_color_brewer(palette="Dark2") +
theme_tufte() +
theme(axis.title.y=element_text(margin=margin(0,10,0,0)),
axis.title.x=element_text(margin=margin(10, 0, 0, 0)))
For base R, use the following code:
library(scales) pal <- viridis_pal()(n=6) show_col(pal)
Implemented in ggplot functions:
scale_(color|fill)_viridis_(c|d)c for continuous, d for discretee.g.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=4) + scale_color_brewer(palette="Set2") + theme_tufte() +
theme(axis.title.y=element_text(margin=margin(0,10,0,0)),
axis.title.x=element_text(margin=margin(10, 0, 0, 0)))
par(mar=c(0,4,0,0)) library(plotwidgets) showPals()
scale_color_manualpal <- plotPals("darkhaze")
pal
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=4) + scale_color_manual(values=pal) + theme_tufte() +
theme(axis.title.y=element_text(margin=margin(0,10,0,0)),
axis.title.x=element_text(margin=margin(10, 0, 0, 0)))
scale_color_manualIn base R, we can use colorRampPalette()
pal_func <- colorRampPalette(c("cyan", "black", "purple"))
pal <- pal_func(15)
pal
## [1] "#00FFFF" "#00DADA" "#00B6B6" "#009191" "#006D6D" "#004848" "#002424" "#000000" "#160422" "#2D0944" "#440D66" "#5B1289" "#7216AB" "#891BCD" "#A020F0"
In ggplot, there is a number of continuous scales available.
scale_(color|fill)_viridis_c (viridis color scale)scale_(color|fill)_gradient (two colors)scale_(color|fill)_gradient2 (three colors)The problem is with exactly defining the break points (which value corresponds to which color?)
print and summary functions called depending on the type of the object to print or summarizeprint, summary are called “generics”R implements two frameworks for OO programming:
Both are used in parallel.
class(starwars) print(starwars) print.data.frame(starwars) tibble:::print.tbl(starwars)
Although a tibble is also a data frame, the first class (tbl) takes precedence and it is displayed with the function from the tibble package.
starwars is an object of class tbl, and we call a function print on that object, it first looks for a function called print.tbl.print.tbl, because it is not attached to our namespace, it has been loaded with the tibble package and R can see it.v1 <- "blabla"
## add a class, not replace
class(v1) <- c("bulba", class(v1))
print.bulba <- function(x, ...) {
cat(paste0("An object of class bulba:\n", x, "\n"))
}
v1
## An object of class bulba: ## blabla
nonsense <- function(x, ...) { UseMethod("nonsense", x) }
nonsense.default <- function(x, ...) {
cat("Oh well, not a bulba then.\n")
}
nonsense.bulba <- function(x, ...) {
cat(paste("This", x, "is nonsense!\n"))
}
nonsense(v1)
## This blabla is nonsense!
nonsense(pi)
## Oh well, not a bulba then.
Remember that everything is a function?
We can define generic operators to work on our class!
v1 <- "a"
class(v1) <- c("bulba", class(v1))
v2 <- "b"
class(v2) <- c("bulba", class(v2))
`+.bulba` <- function(a, b) {
ret <- paste0(a, b)
class(ret) <- "bulba"
return(ret)
}
v1 + v2
## An object of class bulba: ## ab
g1 <- ggplot(data=mtcars, aes(x=disp, y=hp, color=mpg)) + geom_point(size=5) + scale_color_viridis_c() class(g1)
## [1] "gg" "ggplot"
methods(class="gg")
## [1] + ## see '?methods' for accessing help and source code
methods(class="ggplot")
## [1] as_grob ggplot_build plot print summary ## see '?methods' for accessing help and source code
methods(print) %>% { .[grep("ggplot", .)] }
## [1] "print.ggplot" "print.ggplot2_bins"
qplot is an interface to ggplot which uses a syntax similar to the basic plot function.
The theme() and theme_*() functions return an object of the class theme which can be added to a ggplot in order to change appearance of several elements. The list of the elements you can theme can be found in the theme() help page. You can add themes. The result is again a theme object that you can reuse and even set as default. This makes it easy to create your own themes.
In base R:
plot(...., log="xy") # to scale both axes
In ggplot2:
ggplot(data, aes(...)) + ... + scale_x_log10
To avoid labels which are overlaping, we can use the ggrepel package.
(Demo)
There are two important functions in cowplot: predefined theme_cowplot(), which is quite nice, and plot_grid(), which rocks. plot_grid allows you to create separate plots and combine them in a number of ways. You can even draw a plot in basic R, record it and include it in your plot_grid call!
Note: Cowplot defines its own theme, theme_cowplot() and automatically sets it when loaded (I think that is no longer the case in the newest versions). It stays there even if you unload the package, however you can always use theme_set() to set the default theme to something else.
facet_gridYou can get a lattice-like representation using facet_grid() function. For example:
ggplot(mpg, aes(cty, hwy)) + geom_point() + facet_grid(rows=mpg$cyl)
data(mpg)
ggplot(mpg, aes(cty, fill=factor(cyl))) +
geom_density(alpha=0.8) +
labs(title="Density plot",
subtitle="City Mileage Grouped by Number of cylinders",
caption="Source: mpg",
x="City Mileage",
fill="# Cylinders")
library(ggcorrplot)
# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)
# Plot
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of mtcars",
ggtheme=theme_bw)
# ```{r, animation.hook="gifski"}
# for (i in 1:2) {
# pie(c(i %% 2, 6), col = c('red', 'yellow'), labels = NA)
# }
# ```
“The world cannot be understood without numbers. But the world cannot be understood with numbers alone.”
― Hans Rosling, Factfulness: Ten Reasons We’re Wrong About the World—and Why Things Are Better Than You Think
library(ggplot2) theme_set(theme_bw()) library(gapminder) knitr::kable(head(gapminder))
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
gapminder %>% ggplot(aes(x=gdpPercap, y=lifeExp, color=year)) + geom_point()
gapminder %>% ggplot(aes(x=gdpPercap, y=lifeExp, color=year)) + geom_point() + scale_x_log10()
gapminder %>% ggplot(aes(x=gdpPercap, y=lifeExp, color=year)) + geom_point() + scale_x_log10() + scale_color_viridis_c()
gapminder %>% filter(year==2007) %>% ggplot(aes(x=gdpPercap, y=lifeExp, color="continent")) + geom_point() + scale_x_log10() + scale_color_brewer(palette="Dark2")
gapminder %>% filter(year==2007) %>% ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, color="continent")) + geom_point() + scale_x_log10() + scale_color_brewer(palette="Dark2")
gapminder %>% filter(year==2007) %>% ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, color=country)) + geom_point(alpha=.7, show.legend=FALSE) + scale_color_manual(values=country_colors) + scale_x_log10()
g1952 <- gapminder %>% filter(year == 1952) %>% ggplot(aes(x=gdpPercap, y=lifeExp, color=continent)) + geom_point() + scale_color_brewer(palette="Dark2") + xlim(range(gapminder$gdpPercap)) + ylim(range(gapminder$lifeExp)) + scale_x_log10() g2007 <- gapminder %>% filter(year == 2007) %>% ggplot(aes(x=gdpPercap, y=lifeExp, color=continent)) + geom_point() + scale_color_brewer(palette="Dark2") + xlim(range(gapminder$gdpPercap)) + ylim(range(gapminder$lifeExp)) + scale_x_log10() plot_grid(g1952, g2007)
Much easier!
gapminder %>% filter(year %in% c(1952, 2007)) %>% ggplot(aes(x=gdpPercap, y=lifeExp, color=continent)) + scale_color_brewer(palette="Dark2") + geom_point() + facet_grid(. ~ year) + scale_x_log10()
gapminder %>% filter(year %in% c(1952, 2007)) %>% ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, color=continent)) + scale_color_brewer(palette="Dark2") + geom_point() + facet_grid(. ~ year) + scale_x_log10()
tmp <- gapminder %>% filter(year %in% c(1952, 2007)) %>% group_by(continent, year) %>% summarise(mean=mean(gdpPercap), median=median(gdpPercap)) tmp %>% ggplot(aes(x=year, y=mean, color=continent)) + geom_point() + geom_line() + scale_y_log10() + geom_label(aes(label=continent), hjust="outward", show.legend=F) + xlim(1945, 2020)
gapminder %>% filter(year %in% c(1952, 2007) & continent=="Europe") %>% arrange(gdpPercap, year) %>% mutate(country=factor(country, levels=unique(country))) %>% ggplot(aes(x=gdpPercap, y=country, color=year)) + geom_point() + geom_line()
library(gganimate)
g <- gapminder %>% ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, color=continent)) +
geom_point(alpha=.8) +
scale_color_brewer(palette="Dark2") +
scale_x_log10() +
scale_size(range = c(2, 12)) +
transition_time(year) +
labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
ease_aes("linear")
animate(g, duration = 15, fps = 20, width = 800, height = 500, renderer = av_renderer())
anim_save("gapminder.mp4")
Warning: gganimate has huge installation requirements, because you need a renderer library. Depending on your system, this might take a lot of disk space / a lot of headache. For example, using the gifski package requires you to install the rust environment. Also, including in rmarkdown might be problematic.